Advance Analytics with R (UG 21-24)
I am Ayush.
I am a researcher working at the intersection of data, law, development and economics.
I teach Data Science using R at Gokhale Institute of Politics and Economics
I am a RStudio (Posit) certified tidyverse Instructor.
I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.
Reach me
ayush.ap58@gmail.com
ayush.patel@gipe.ac.in
Dip our toes into classification techniques. How to apply and assess these methods.
References for this lecture:
“….often the methods used for classification first predict the probability that the observation belongs to each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods.”
Default data| default | student | balance | income |
|---|---|---|---|
| No | No | 729.5265 | 44361.625 |
| No | Yes | 817.1804 | 12106.135 |
| No | No | 1073.5492 | 31767.139 |
| No | No | 529.2506 | 35704.494 |
| No | No | 785.6559 | 38463.496 |
| No | Yes | 919.5885 | 7491.559 |
| No | No | 825.5133 | 24905.227 |
| No | Yes | 808.6675 | 17600.451 |
| No | No | 1161.0579 | 37468.529 |
| No | No | 0.0000 | 29275.268 |
| No | Yes | 0.0000 | 21871.073 |
| No | Yes | 1220.5838 | 13268.562 |
| No | No | 237.0451 | 28251.695 |
| No | No | 606.7423 | 44994.556 |
| No | No | 1112.9684 | 23810.174 |
| No | No | 286.2326 | 45042.413 |
| No | No | 0.0000 | 50265.312 |
| No | Yes | 527.5402 | 17636.540 |
| No | No | 485.9369 | 61566.106 |
| No | No | 1095.0727 | 26464.631 |
Default is our response(\(Y\)).Yes or No.I ran this: \(p(balance) = \beta_0 + \beta_1X\)
## make a dummy for default
Default|>
mutate(
default_dumm = ifelse(
default == "Yes",
1,0
)
)-> def_dum
## regress dummy over balance and plot
lm(default_dumm ~ balance,
data = def_dum)|>
broom::augment()|>
ggplot(aes(balance,default_dumm))+
geom_point(alpha= 0.6)+
geom_line(aes(balance, .fitted),
colour = "red")+
labs(
title = "Linear regression fit to qualitative response",
subtitle = "Yes =1, No = 0",
y = "prob default status"
)+
theme_minimal() -> plot_linear
## Run the logistic regression
glm(
default_dumm ~ balance,
data = def_dum,
family = binomial
)|>
broom::augment(type.predict = "response")|>
ggplot(aes(balance,default_dumm))+
geom_point(alpha= 0.6)+
geom_line(aes(balance, .fitted),
colour = "red")+
labs(
title = "Logistic regression fit to qualitative response",
subtitle = "Yes =1, No = 0",
y = "prob default status"
)+
theme_minimal() -> logistic_plotWe saw that some fitted values in the linear model were negative.
We need a function that will return values between [0,1].
\[p(X) = \frac{e^{(\beta_0 + \beta_1X)}}{1+e^{\beta_0 + \beta_1X}}\]
This is the logistic function, modeled by the maximum likelihood method.
odds:
\[\frac{p(X)}{1-p(X)}\] **log odds or logit:
\[log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X\]
if the following are the results of the model \(logit(p(default)) = \beta_0 + \beta_1Balance\):
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -10.651330614 | 0.3611573721 | -29.49221 | 3.623124e-191 |
| balance | 0.005498917 | 0.0002203702 | 24.95309 | 1.976602e-137 |
What is the probability of default with balance $5000??
\[p(X) = \frac{e^{(\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_nX_n)}}{1+e^{\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_nX_n}}\]
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -1.086905e+01 | 4.922555e-01 | -22.080088 | 4.911280e-108 |
| income | 3.033450e-06 | 8.202615e-06 | 0.369815 | 7.115203e-01 |
| balance | 5.736505e-03 | 2.318945e-04 | 24.737563 | 4.219578e-135 |
| studentYes | -6.467758e-01 | 2.362525e-01 | -2.737646 | 6.188063e-03 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -3.5041278 | 0.07071301 | -49.554219 | 0.0000000000 |
| studentYes | 0.4048871 | 0.11501883 | 3.520181 | 0.0004312529 |
There is no consesus in statistics community over a single measure that can describe a goodness of fit for logistic regression.
Use the Credit data in {ISLR}.
What you just did is called Stratified Approach to Multinomial Logistic Regression.